Using RFC3229 with Feeds.
[Updated: 22-Sep-2004 14:33 with link to list of implementations]
The other day I wrote that we really should be adopting RFC3229 "Delta Encoding in HTTP" in order to reduce the amount of bandwidth, etc. that is wasted in serving RSS and Atom files. I'm fairly convinced that if the folk at Microsoft had been using what I propose here, they would not have been forced to take the drastic measures that they did when they did.
Of course, use of RFC3229 would have only delayed, not eliminated the day when the current practice of polling for updates to RSS files would have become excessively expensive for Microsoft. The real solution to the bandwidth problem is to move from polling to a push-based solution. But, at least by implementing RFC3229, we can take the polling solution just about as far as it can be taken -- in terms of efficiency... This is a good intermediate step on the way to the push-based solutions that we won't have much choice but to implement as the audience for RSS and Atom data grows.
This post is intended to provide additional detail on what I'm proposing. It is my intention to create a Internet Draft describing the ideas here once I've had reasonable time to receive comments from folk and work out the inevitable bugs. Please feel free to comment on what is below:
Feeds aren't like HTML
Atom and RSS files are members of a distinct class of files that we call "feeds." Conceptually, a "feed" is a potentially infinitely long and growing series of items or entries. In order to reduce the cost of distributing new entries, RSS and Atom feeds are typically implemented as "sliding window" feeds. Such feeds don't contain every entry that has ever been published. Rather, they only contain some number of the most recent changes to the feed.
Common practice today is for feed providers to establish a certain fixed "window size" which defines the maximum number of entries that are contained in any instance of the feed file. Once the maximum number of entries has been reached, then every retrieval of the feed from then on will always receive that number of entries -- even if some smaller number of entries has been inserted into the feed since the last time the feed was retrieved by a specific client. The result is a great deal of wasted bandwidth and processing resource.
In order to allow the number of entries returned in a feed to be no more than the total number of new or modified inserted into the feed since the last time any specific client retrieved the feed, I propose that we rely on RFC3229 "Delta encoding in HTML" with a new instance-manipulation method defined to provide feed specific delta encoding.
The "feed" instance-manipulation method
The "feed" instance manipulation method is an abstract method for which concrete forms can be defined for use with various content types. In this document, I'll be speaking primarily about the use of Feed IM as appropriate for Atom files.
Unlike the IM methods currently registered for use with RFC3229, the "feed" IM method is not byte-oriented. Rather, it is item or entry oriented in that deltas are computed not in bytes but rather in whole items or entries. The definition of those items or entries is dependent on the underlying content type. For instance, if the content type is RSS, the delta unit is an "item". If the content type is Atom, the delta unit is an"entry". If the content type is "log file" then the delta unit is "lines".
When the "feed" IM method is applied to an instance, the result should conform to whatever are the syntactical requirements for the type of the instance. Thus, if the instance is Atom formatted, the result of applying the "feed" IM method would be Atom-conformant. This implies that the result would have atom:entry elements which would be wrapped in an atom:feed element that contained an atom:head element.
The detailed rules for applying "feed" instance-manipulation for various types should be easily derived from what is said above.
The requirement that the result of applying the "feed" IM method to an instance will result in a result of the same type as the instance presents an interesting opportunity. While byte-oriented IM methods must never be used unless specifically requested by a client -- since not every client may support the result or have the history needed to interpret it, this requirement need not exist for the feed IM method. Thus, servers that implement this method are free to apply it by default even if it is not requested.
feed: A worked example
The following shows what an RFC3229-compliant request to a server might look like:
GET /atom.xml HTTP/1.1
Host: bar.example.net
If-None-Match: "321"
A-IM: feed, gzip
- The client wants to obtain the current value of /atom.xml
- It has previously received an instance whose entity tag is "321"
- It is willing to accept delta-encoded updates using the "feed" IM method. (Note: It is not strictly necessary for the client to request the "feed" IM method in all cases since some servers may actually apply this method by default. Nonetheless, it is good form to request it since some servers may not use the method unless it is requested.)
- It is willing to accept responses that have been compressed using "gzip," whether or not these responses have been delta-encoded.
If, when this request is received, the server's current entity tag for the resource is still "321," then the server should simply return a 304 (not modified) response, as would a traditional server.
If the entity tag has changed, the server could compute the delta between the entity whose entity tag was "321" and the current instance. If the server no longer knows what the "321" entity tag corresponds to, it would probably send the entire feed.
If the client requests delta-encoding but the server doesn't support this form of instance manipulation, the server will simply ignore this aspect of the request.
If the server responds with a delta encoded response, it would look something like this:
HTTP/1.1 226 IM Used
ETag: "4321"
IM: feed, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Cache-Control: no-store, im
...
- The response status is 226 IM Used -- a success code.
- The entity tag given is that of the new state of the resource.
- The response carries an "IM" response-header field, indicating which delta encoding is used in the response.
- The Cache-control "no-store" is used to ensure that caches that do not understand delta-encoding do not cache this response. However, a cache that does understand the use of instance-manipulation is allowed to ignore the "no-store" directive which would otherwise be mandatory.
- The message-body is first delta-encoded using the feed IM method appropriate for the type of feed and is then gzipped.
For a list of implementations of RFC3229 with the "feed" IM method, click here. For statistics showing the savings that have resulted from early implementations, click here.
f-range: A feed oriented Range
Just as it is appropriate to define a feed specific delta IM method, it is appropriate to provide a feed-specific IM method for range selection. RFC3229 currently only supports byte-oriented range selection.
The f-range IM method uses the content type's concept of item or entry as its unit of selection. Thus, "F-range: entries=1-20" would specify that the client only wanted to receive a maximum of 20 items starting at the "first" or most recent item in the feed. The specification "F-range: entries=20-" would indicate that all items, starting at the 20th oldest, should be returned in the result. All item offsets should be computed based on the state of the feed associated with the entity tag passed in the request or the If-F-Range if provided. Thus, it is possible for limited resource clients to "chunk" their way through a large number of available items in a fast moving feed.
When responding to an F-range request, the response should contain the entity tag associated with the feed at the time of the response and the cache control statements should be set to prevent caching.
f-range: A worked example
GET /atom.xml HTTP/1.1
Host: bar.example.net
If-None-Match: "321"
A-IM: feed, f-range, gzip
F-range: entries=1-20
- This request asks first for all entries added since the entity tag "321"
- The set of items in the response is limited to the most recent 20
- The response should be gzipped.
HTTP/1.1 226 IM Used
ETag: "4321"
IM: feed, f-range, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Cache-Control: no-store
...
Benefits of the approach
Implementing and deploying the "feed" IM method will provide the same general benefits as are provided by the various byte-oriented IM methods of RFC3229. These benefits are:
- A reduction in the mean size of HTTP responses, thereby improving latency and network utilization. For actual numbers which show savings from early implementations of RFC3229+feed, click here.
- Avoidance of any extra network round trips
- Minimization of per-request and per-response overheads.
- Support for a variety of encoding algorithms and formats.
- Interoperation with HTTP/1.0 and HTTP/1.1.
- Fully optional for clients, proxies, and servers.
- Moderately simple implementations are possible.
Why not use Vary to indicate the additional headers which provide a complete caching context?
With respect to Range, I've just written up the use of this HTTP header for addressing and direct manipulation of subresources using the XPointer Framework for an extensible addressing platform. Since the conference (Extreme Markup 2004) has not yet (!?!) published the paper online, I'll send you a copy in email.
By the way, I did a similar thing with REST-ful queues where the parameters were in the query string. Worked fine. I used it to back an RSS service so that you could get slices of the RSS feed based on any indexed properties from the queue entries. I've been meaning to revisit this using the Range header.
-bryan
Posted by: bryan | September 13, 2004 at 08:06
James Robinson reports that he has implemented diffe based RFC3229 support for Wordpress. Hopefully, he'll support the "feed" IM method as well... see:
http://www.robinsonhouse.com/2004/09/14/rss-and-delta-encoding/
bob wyman
Posted by: Bob Wyman | September 14, 2004 at 19:59
Bob - I just implemented RFC3229 for ExpressionEngine this morning allowing both diffe and feed, and I am wondering if the Apache support is really the only thing holding this back now?
Posted by: Paul Burdick | September 16, 2004 at 13:44
People here should be aware that ICE (Information & Content Exchange) is an XML-based syndication standard that has been providing incremental updates since 1998 or so. There are some real limitations of presenting "deltas" at the HTTP layer rather than the application layer, because (IMO) the semantics are more appropriate. That is, HTTP offsets are really intended to be byte offsets into the content of the resource being delivered via HTTP, while at the application (ICE) level it makes sense to have instructions such as "here's a new version of story 123" or "delete story 456, and add story 789".
For info on ICE2, see http://www.icestandard.org.
Posted by: Laird Popkin | September 21, 2004 at 16:51
Is there anyway to indicate that an entry has been deleted? Would there need to be?
Posted by: Winter | September 24, 2004 at 16:29
But isn't the problem of bandwidth not already been solved in HTTP? There are many headers in the HTTP specification that helps with this. Speaking as a blogging hoster who supplies GB's of RSS feeds each month, it amazes me how many hosts/feedreaders do not even attempt to present Last-Modified, or ETAG data to determine if they have the latest information already. If they were to do this at least, then the amount of bandwidth consumed would drop like a stone overnight.
So while i applaud this effort, i can't see it solving the problem. It is just another layer of administration that feedreaders will not implement. Its laziness and inability to appreciate that HTTP has already answered many of the problems facing our RSS world today. Previously it was easy; it was largely down to the browser engineers; and since there was a finite amount of them, they all adhered to the standard. However, since there is so many different RSS readers out there, including all the RSS Developers APIs for reading and parsing feeds, no one is bothering to think about the poor HTTP protocol wire its eventually going out on. So when a developer creates his new RSS reader, its not his bandwidth they are consuming, but instead the countless clients of their creation. Why should he be worried about the bandwidth?
Therefore, instead of actually using the tools and headers available, we find ourselves having to create more standards to solve the problem. I don't think its the way to go.
Posted by: Alan Williamson | December 02, 2004 at 18:16
Alan, certainly we would all be better off if clients actually implemented conditional GETs with "if-modified-since" or Etags. However, even if they did, we would still be wasting bandwidth in the case where some of a feed has changed but not all of it. In terms of priority, supporting conditional GETs is certainly higher then RFC3229+feed, however, once you've provided support for conditional GETs, the incremental effort to support RFC3229+feed is trivial while the benefits are great. As I've documented elsewhere on this blog, we saw a massive drop in bandwidth needs the moment we implemented RFC3229+feed ourselves. This is because many of the more popular feed readers *have* built in support for RFC3229+feed. If you were to implement this at blog-city, I think you would also find massive improvements -- even though there are still many clients that don't support it.
bob wyman
Posted by: Bob Wyman | December 02, 2004 at 18:48
Great site, well done. I enjoy beeing here and i´ll come back soon. You do a great job. Many greetings.
Posted by: Talea Joy | January 05, 2005 at 04:26
Great site, well done. I enjoy beeing here and i´ll come back soon. You do a great job. Many greetings.
Posted by: Frauke | May 01, 2005 at 04:29
great article adn well written
Posted by: bandwidth | September 23, 2005 at 23:24
You are right. I lose lot of bandwidth to the feeds
Posted by: Funny Videos | October 14, 2005 at 01:14
Hi,
Can I use delta-encoding in cache-nocache multipart messages?
How to apply delta-encoding inside an html page?
Thanks a lot
Posted by: Andres | February 13, 2006 at 11:29